Contents

0.1 Introduction

In this session, we will reproduce some of the same analyses presented during Hands On Sessions 1 and 2 using the R programming language. This will include:

This tutorial is based on the following resources:

0.2 Required packages

library(Seurat)
library(SeuratData)
library(SeuratDisk)
library(SingleCellExperiment)

0.3 Data

The dataset that we will work with, generated by Setty et al. includes bone marrow stem/progenitor CD34+ cells from healthy donors, sequenced using the Chromium Single Cell 3’ Library preparation kit. Although the full dataset includes cells from three donors, we will focus on replicate 1 (male, age 35).

All data and data-related files will be avaiable under the following directory:

data_dir <- c("HandsOn3_SingleCell_Data_QC/")

0.4 Getting help with R

All R programming is based on functions, each of which have their own arguments and inputs. If you are ever doubtful as to how to use a function, you may access function documentation using the ? operator.

For instance, you may look at the usage of the Read10X() function from the Seurat R package, which will be used in the next section to load our single-cell expression matrix into R:

?Read10X

0.5 Data loading and object creation

There are two main families of R packages for single-cell data analyses, each of which use a different class of object to store and interact with the data:

While this course is mainly based on Seurat for single-cell data analysis, it is good to be familiar with both types of objects, as there are interesting packages and analyses outside the Seurat environment.

0.5.1 SeuratObject

Single-cell data can be provided in a number of formats, one of which is the sparse matrix format. These can be read into R using the Seurat function mentioned above and supplying the path to the directory containing the files:

setty_rep1_matrix <- Read10X("setty_rep1/")
## Warning in readLines(con = barcode.loc): incomplete final line found on
## 'setty_rep1//barcodes.tsv.gz'

The expression matrix is loaded as a dgCMatrix object, which is a compressed matrix format that saves space by only reporting non-zero elements:

class(setty_rep1_matrix)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
head(setty_rep1_matrix[c(1:10),c(1:10)])
## 6 x 10 sparse Matrix of class "dgCMatrix"
##   [[ suppressing 10 column names 'AAACCTGAGAAACGCC', 'AAACCTGAGCAATCTC', 'AAACCTGCAAGCGAGT' ... ]]
##                                
## DDX11L2     . . . . . . . . . .
## DDX11L1     . . . . . . . . . .
## WASH7P      . . . . . . . . . .
## MIR6859-1   . . . . . . . . . .
## MIR1302-2HG . . . . . . . . . .
## MIR1302-2   . . . . . . . . . .

Note that the matrix is still formatted as usual: with columns being cell IDs (typically encoded as cell barcode sequences) and rows representing genes:

head(colnames(setty_rep1_matrix))
## [1] "AAACCTGAGAAACGCC" "AAACCTGAGCAATCTC" "AAACCTGCAAGCGAGT" "AAACCTGCACTTAAGC"
## [5] "AAACCTGGTAAGTAGT" "AAACCTGGTACCCAAT"
head(rownames(setty_rep1_matrix))
## [1] "DDX11L2"     "DDX11L1"     "WASH7P"      "MIR6859-1"   "MIR1302-2HG"
## [6] "MIR1302-2"

We can use this matrix to create a SeuratObject to store expression as well as other metrics and metadata. We will use the CreateSeuratObject() function, setting min.cells = 1 to only include genes that are expressed in at least one cell. You may read more about this function accessing the help page (?CreateSeuratObject).

setty_seurat <- CreateSeuratObject(setty_rep1_matrix,
                                   project = "healthyBM",
                                   min.cells = 1)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
setty_seurat
## An object of class Seurat 
## 35102 features across 6381 samples within 1 assay 
## Active assay: RNA (35102 features, 0 variable features)

Observe that the dataset contains 39306 genes (i.e. features) and 6998 cells (i.e. samples). The SeuratObject only contains one assay, named “RNA”. If more modalities or other data types were supplied, they would be stored as additional assays in the same SeuratObject.

The SeuratObject includes multiple slots where different types of information can be stored. The @ operator opens up the different slots in the object, including the assays slot, where the “RNA” assay is contained. Inside it, there other slots featuring assay-specific information. Here’s how the hierarchy works:

setty_seurat@assays
## $RNA
## Assay data with 35102 features for 6381 cells
## First 10 features:
##  ENSG00000238009, CICP27, ENSG00000268903, ENSG00000241860,
## ENSG00000228463, ENSG00000290385, ENSG00000230021, ENSG00000235146,
## MTND1P23, MTND2P28
setty_seurat@assays$RNA
## Assay data with 35102 features for 6381 cells
## First 10 features:
##  ENSG00000238009, CICP27, ENSG00000268903, ENSG00000241860,
## ENSG00000228463, ENSG00000290385, ENSG00000230021, ENSG00000235146,
## MTND1P23, MTND2P28
head(setty_seurat@assays$RNA@counts[c(1:10),c(1:10)])
## 6 x 10 sparse Matrix of class "dgCMatrix"
##   [[ suppressing 10 column names 'AAACCTGAGAAACGCC', 'AAACCTGAGCAATCTC', 'AAACCTGCAAGCGAGT' ... ]]
##                                    
## ENSG00000238009 . . . . . . . . . .
## CICP27          . . . . . . . . . .
## ENSG00000268903 . . . . . . . . . .
## ENSG00000241860 . . . . . . . . . .
## ENSG00000228463 . . . . . . . . . .
## ENSG00000290385 . . . . . . . . . .

The object also contains cell-level information, known as metadata, which can be accessed as follows:

head(setty_seurat[[]])
##                  orig.ident nCount_RNA nFeature_RNA
## AAACCTGAGAAACGCC  healthyBM       6164         2808
## AAACCTGAGCAATCTC  healthyBM       4550         2258
## AAACCTGCAAGCGAGT  healthyBM       8697         3406
## AAACCTGCACTTAAGC  healthyBM       4325         1946
## AAACCTGGTAAGTAGT  healthyBM       2144         1370
## AAACCTGGTACCCAAT  healthyBM      10389         3940

These statistics are calculated by default upon SeuratObject creation, but more columns will be added to this table as we perform additional analyses.

0.5.2 SingleCellExperiment

The loaded dgCMatrix object can be made into a SingleCellExperiment object using the SingleCellExperiment() function. In this case, we will supply a list object in which every element will be one assay. This allows us to set assay names:

setty_sce <- SingleCellExperiment(list(RNA=setty_rep1_matrix))

class(setty_sce)
## [1] "SingleCellExperiment"
## attr(,"package")
## [1] "SingleCellExperiment"
setty_sce
## class: SingleCellExperiment 
## dim: 62754 6381 
## metadata(0):
## assays(1): RNA
## rownames(62754): DDX11L2 DDX11L1 ... U6.36 U1.5
## rowData names(0):
## colnames(6381): AAACCTGAGAAACGCC AAACCTGAGCAATCTC ... TTTGTCATCGCCGTGA
##   TTTGTCATCGTCTGCT
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

While the formatting is different, you can see that the same logic as with the SeuratObject class applies: there are multiple assays and different slots designed to store information related to the single-cell dataset. There is additional information regarding how to operate using this class in the SingleCellExperiment package documentation.

0.5.3 Interoperability between formats

Note that, to allow compatibility between different Bioconductor analysis packages, the Seurat package includes functions to convert to and from SingleCellExperiment objects.

For example, the same SCE object can be generated using the SeuratObject we had already created:

setty_sce <- as.SingleCellExperiment(setty_seurat)

setty_sce
## class: SingleCellExperiment 
## dim: 35102 6381 
## metadata(0):
## assays(2): counts logcounts
## rownames(35102): ENSG00000238009 CICP27 ... ENSG00000278817
##   ENSG00000277196
## rowData names(0):
## colnames(6381): AAACCTGAGAAACGCC AAACCTGAGCAATCTC ... TTTGTCATCGCCGTGA
##   TTTGTCATCGTCTGCT
## colData names(4): orig.ident nCount_RNA nFeature_RNA ident
## reducedDimNames(0):
## mainExpName: RNA
## altExpNames(0):

The function makes a more complete SCE object, including both counts and logcounts assays, as well as the same cell-level metadata that we found in the SeuratObject after creation.

Similarly, a SingleCellExperiment object can be converted to a SeuratObject:

setty_seurat <- as.Seurat(setty_sce)

setty_seurat
## An object of class Seurat 
## 35102 features across 6381 samples within 1 assay 
## Active assay: RNA (35102 features, 0 variable features)

0.5.4 Loading a reduced version of the dataset

In order for this tutorial to run smoothly, we have supplied a reduced version of the Setty et al. dataset featuring a random selection of 3K cells, instead of the ~7K in the original expression matrix. This will prevent memory issues and will allow you to run the analyses in a shorter time span.

The reduced dataset is stored as a h5seurat file, which is a file format designed to store SeuratObjects in an efficient manner. It can be loaded using the LoadH5Seurat() function:

setty_seurat <- LoadH5Seurat("data_preparation/setty_seurat.h5seurat")
## Validating h5Seurat file
## Initializing RNA with data
## Adding counts for RNA
## Adding miscellaneous information for RNA
## Adding command information
## Adding cell-level metadata
## Adding miscellaneous information
## Adding tool-specific results

Note that creating an object with the same name will overwrite the previous SeuratObject. Even though we are doing it on purpose in this case, remember to check for it when you are running your own analyses in R to prevent data loss!

Finally, you may remove unnecessary objects from your R session to free memory before you continue:

rm(setty_sce, setty_rep1_matrix)

0.6 Exploratory analysis: Principal Component Analysis (PCA)

The high number of genes and cells in a single-cell expression matrix makes it difficult to apply visualization techniques that can summarize the trends and structure of the data at hand. Dimensionality reduction is often employed to make this task easier. Among these methods, Principal Component Analysis (PCA) is the most commonly used, mainly due to its high interpretability. This makes it a powerful tool for exploratory analysis and to identify biases in the data.

To run PCA, we first need to scale the data to eliminate biases. While this can be done using other functions, we will use the ScaleData() function in Seurat, which does the following:

This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate. The results of this are stored in setty_seurat[["RNA"]]@scale.data. We will run the scaling for all genes in the data:

# select all gene names
all_genes <- rownames(setty_seurat)

# scale data
setty_seurat <- ScaleData(setty_seurat, features = all_genes)
## Centering and scaling data matrix
# view the first 4 rows and columns of the resulting matrix
setty_seurat[["RNA"]]@scale.data[1:4, 1:4]
##                 GACAGAGCAGGAATGC GCGGGTTGTCGTTGTA AGTGAGGCAGCCTGTG
## ENSG00000238009      -0.02582419      -0.02582419      -0.02582419
## CICP27               -0.03163333      -0.03163333      -0.03163333
## ENSG00000268903      -0.03653311      -0.03653311      -0.03653311
## ENSG00000241860      -0.02582419      -0.02582419      -0.02582419
##                 AGACGTTGTAGGCATG
## ENSG00000238009      -0.02582419
## CICP27               -0.03163333
## ENSG00000268903      -0.03653311
## ENSG00000241860      -0.02582419

Now, we can perform linear dimension reduction using the RunPCA() function in Seurat:

setty_seurat <- RunPCA(setty_seurat, 
                       features = all_genes,
                       verbose = FALSE)

Results will be stored in the setty_seurat@reductions slot:

setty_seurat
## An object of class Seurat 
## 35102 features across 3000 samples within 1 assay 
## Active assay: RNA (35102 features, 0 variable features)
##  1 dimensional reduction calculated: pca
setty_seurat@reductions
## $pca
## A dimensional reduction object with key PC_ 
##  Number of dimensions: 50 
##  Projected dimensional reduction calculated:  FALSE 
##  Jackstraw run: FALSE 
##  Computed using assay: RNA

To plot PCA, we will use the DimPlot() function. This function can be used to plot any reduced dimension representation of the data, so we will need to specify which one we wish to plot:

DimPlot(setty_seurat, reduction = "pca")

Since we have not yet normalized our data, cells are not properly clustered into groups -quite contrarily, they appear together and spread across the diagonal between PC1 and PC2, suggesting that there are some sources of unwanted variation that we will need to remove before we continue.


0.7 Quality control

0.7.1 Cell quality control

Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. The number of genes and counts per cell, which are standard metrics in the field, are already computed by Seurat and stored in the metadata slot:

head(setty_seurat[[]])
##                  orig.ident nCount_RNA nFeature_RNA
## GACAGAGCAGGAATGC  healthyBM       4160         1855
## GCGGGTTGTCGTTGTA  healthyBM       3662         1991
## AGTGAGGCAGCCTGTG  healthyBM       4418         2179
## AGACGTTGTAGGCATG  healthyBM       4419         2089
## GAGTCCGTCAGTGTTG  healthyBM       6885         2836
## GCAGTTATCCTGTACC  healthyBM       3646         2110

To add to this, we will compute the percentage of reads mapping to mitochondrial genes using the PercentageFeatureSet() function. We will add this information to a new column called percent.mt using the [[ operator, which adds columns to metadata. All mitochondrial gene IDs start with “MT”, which makes it easy to select the genes of interest:

setty_seurat[["percent.mt"]] <- PercentageFeatureSet(setty_seurat, 
                                                     pattern = "^MT-")

We can visualize several QC metrics using the VlnPlot() function:

VlnPlot(setty_seurat, 
        features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), 
        ncol = 3)

We can see that there are a few outlier cells with high numbers of counts/genes and high mitochondrial gene counts.

To better understand the relationship between the QC metrics, the FeatureScatter() function can be used to visualize feature-feature relationships. We will plot the metrics to find thresholds for data filtering:

FeatureScatter(setty_seurat, feature1 = "nCount_RNA", feature2 = "percent.mt")

FeatureScatter(setty_seurat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

Observe the following trends in these plots: - nCount_RNA vs percent.mt: there are a number of cells for which the % of mitochondrial counts is high, whereas the number of total counts is low. These are likely to be dying cells. - nCount_RNA vs nFeature_RNA: some cells show a high number of counts as well as a high number of features (genes). These are likely to be doublets.

Based on the QC plots, we will filter cells with >10% mitochondrial counts and >6000 and <500 genes. This can be achieved using the subset() function in Seurat:

setty_seurat <- subset(setty_seurat, 
                      subset = nFeature_RNA > 500 & 
                      nFeature_RNA < 6000 &
                      percent.mt < 10)

setty_seurat
## An object of class Seurat 
## 35102 features across 2983 samples within 1 assay 
## Active assay: RNA (35102 features, 0 variable features)
##  1 dimensional reduction calculated: pca

As a result, we are left with 2864 high-quality cells for downstream analysis.

0.7.2 Feature quality control

Feature QC is not included in the standard Seurat pipeline -instead, the tool is designed to preserve all genes in the expression matrix, filtering them only upon SeuratObject creation, as was demonstrated before. However, it is often better to to remove genes after cell QC, given that the main filtering criteria is the frequency of non-zero expression across the dataset.

We will calculate the percentage of cells in which is gene is expressed to determine whether the genes should remain in the dataset. To do this, we will need to extract the count matrix from the SeuratObject:

counts <- GetAssayData(setty_seurat,
                       assay = "RNA", slot = "counts")

total_expr <- rowSums(counts > 0)

head(total_expr)
## ENSG00000238009          CICP27 ENSG00000268903 ENSG00000241860 ENSG00000228463 
##               2               3               4               2               4 
## ENSG00000290385 
##               2

Briefly, counts>0 returns a matrix where each entry is TRUE/FALSE if that entry of the counts matrix exceeds 0. Performing rowSums() on that matrix gives the total # of cells in which expression is non-zero for each gene.

We will keep genes that are expressed in at least 3 cells, which constitutes 1% of our current dataset. We will filter the total_expr vector and use the names() function to extract the IDs of our genes of interest:

keep_genes <- names(total_expr[total_expr > 3])

length(keep_genes)
## [1] 25228
head(keep_genes)
## [1] "ENSG00000268903" "ENSG00000228463" "ENSG00000230021" "MTND1P23"       
## [5] "MTND2P28"        "MTCO1P12"

The subset() function we used above can filter genes based on a list of IDs supplied to the features argument:

setty_seurat <- subset(setty_seurat, 
                       features = keep_genes)

setty_seurat
## An object of class Seurat 
## 25228 features across 2983 samples within 1 assay 
## Active assay: RNA (25228 features, 0 variable features)
##  1 dimensional reduction calculated: pca

As a result, 26551 genes remain in the dataset.


0.8 Normalization and feature selection

0.8.1 Normalization

After removing unwanted cells from the dataset, the next step is to normalize the data. By default, Seurat employs a global-scaling normalization method, “LogNormalize”. This method normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. While there are other, more complex normalization strategies, this will suffice to remove library size biases from our data.

Normalized values are stored in setty_seurat[["RNA"]]@data, whereas the raw expression matrix is preserved in the setty_seurat[["RNA"]]@counts slot.

setty_seurat <- NormalizeData(setty_seurat, 
                             normalization.method = "LogNormalize")

setty_seurat[["RNA"]]@data[1:4, 1:4]
## 4 x 4 sparse Matrix of class "dgCMatrix"
##                 GACAGAGCAGGAATGC GCGGGTTGTCGTTGTA AGTGAGGCAGCCTGTG
## ENSG00000268903          .               .                .       
## ENSG00000228463          .               .                .       
## ENSG00000230021          .               .                .       
## MTND1P23                 3.00789         1.866092         2.824224
##                 AGACGTTGTAGGCATG
## ENSG00000268903                .
## ENSG00000228463                .
## ENSG00000230021                .
## MTND1P23                       .

0.8.2 Feature selection

Next, we will calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). It has been described that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.

In Seurat, this is done using the FindVariableFeatures() function, which stores results in the setty_seurat[["RNA"]]@var.features slot:

setty_seurat <- FindVariableFeatures(setty_seurat)

# view top 10
head(setty_seurat[["RNA"]]@var.features, n = 10)
##  [1] "HBB"    "CLC"    "LYZ"    "ELANE"  "LTBP1"  "AHSP"   "PRTN3"  "H4C3"  
##  [9] "S100A8" "LTB"

Seurat also includes a function, VariableFeaturePlot() to plot variable features in order to better understand their properties:

VariableFeaturePlot(setty_seurat)

These features will be used in downstream analysis, including PCA. We run a previous PCA with all genes for exploratory purposes, however, it is generally more efficient to select a set of highly variable features, as they capture the majority of data variation. Note that data will need to be re-scaled using the normalized counts instead of the raw counts:

setty_seurat <- ScaleData(setty_seurat, 
                          features = all_genes)
## Centering and scaling data matrix
setty_seurat <- RunPCA(setty_seurat, 
                       features = VariableFeatures(setty_seurat),
                       verbose = FALSE)

DimPlot(setty_seurat)

0.9 Cell cycle scoring

In addition to normalization, we may want to eliminate other effects that may interfere with data analysis. Among them, the cell cycle phase of each cell is known to be a major source of variation in single-cell data.

To do this, each cell is assigned a score based on its expression of G2/M and S phase markers. These markers must be known previously and supplied by the user. Luckily, Seurat provides a list of genes for each phase in the cc.genes.updated.2019 object, which is loaded together with the package:

cc.genes.updated.2019
## $s.genes
##  [1] "MCM5"     "PCNA"     "TYMS"     "FEN1"     "MCM7"     "MCM4"    
##  [7] "RRM1"     "UNG"      "GINS2"    "MCM6"     "CDCA7"    "DTL"     
## [13] "PRIM1"    "UHRF1"    "CENPU"    "HELLS"    "RFC2"     "POLR1B"  
## [19] "NASP"     "RAD51AP1" "GMNN"     "WDR76"    "SLBP"     "CCNE2"   
## [25] "UBR7"     "POLD3"    "MSH2"     "ATAD2"    "RAD51"    "RRM2"    
## [31] "CDC45"    "CDC6"     "EXO1"     "TIPIN"    "DSCC1"    "BLM"     
## [37] "CASP8AP2" "USP1"     "CLSPN"    "POLA1"    "CHAF1B"   "MRPL36"  
## [43] "E2F8"    
## 
## $g2m.genes
##  [1] "HMGB2"   "CDK1"    "NUSAP1"  "UBE2C"   "BIRC5"   "TPX2"    "TOP2A"  
##  [8] "NDC80"   "CKS2"    "NUF2"    "CKS1B"   "MKI67"   "TMPO"    "CENPF"  
## [15] "TACC3"   "PIMREG"  "SMC4"    "CCNB2"   "CKAP2L"  "CKAP2"   "AURKB"  
## [22] "BUB1"    "KIF11"   "ANP32E"  "TUBB4B"  "GTSE1"   "KIF20B"  "HJURP"  
## [29] "CDCA3"   "JPT1"    "CDC20"   "TTK"     "CDC25C"  "KIF2C"   "RANGAP1"
## [36] "NCAPD2"  "DLGAP5"  "CDCA2"   "CDCA8"   "ECT2"    "KIF23"   "HMMR"   
## [43] "AURKA"   "PSRC1"   "ANLN"    "LBR"     "CKAP5"   "CENPE"   "CTCF"   
## [50] "NEK2"    "G2E3"    "GAS2L3"  "CBX5"    "CENPA"

Scores can be assigned in Seurat using the CellCycleScoring() function, which stores S and G2/M scores in SeuratObject meta data, along with the predicted classification of each cell in either G2M, S or G1 phase:

s_genes <- cc.genes.updated.2019$s.genes
g2m_genes <- cc.genes.updated.2019$s.genes

setty_seurat <- CellCycleScoring(setty_seurat, 
                                 s.features = s_genes,
                                 g2m.features = g2m_genes)
## Warning: The following features are not present in the object: FEN1, not
## searching for symbol synonyms

## Warning: The following features are not present in the object: FEN1, not
## searching for symbol synonyms
head(setty_seurat[[]])
##                  orig.ident nCount_RNA nFeature_RNA percent.mt     S.Score
## GACAGAGCAGGAATGC  healthyBM       4157         1852   4.230769 -0.08714211
## GCGGGTTGTCGTTGTA  healthyBM       3661         1990   2.676133  0.77462014
## AGTGAGGCAGCCTGTG  healthyBM       4417         2178   3.598914 -0.19621708
## AGACGTTGTAGGCATG  healthyBM       4415         2085   3.077619 -0.20911893
## GAGTCCGTCAGTGTTG  healthyBM       6881         2832   3.006536 -0.13072768
## GCAGTTATCCTGTACC  healthyBM       3643         2107   2.797586 -0.18631346
##                    G2M.Score Phase
## GACAGAGCAGGAATGC -0.08991974    G1
## GCGGGTTGTCGTTGTA  0.74176489     S
## AGTGAGGCAGCCTGTG -0.21445121    G1
## AGACGTTGTAGGCATG -0.24489098    G1
## GAGTCCGTCAGTGTTG -0.12892039    G1
## GCAGTTATCCTGTACC -0.18573606    G1

Running a PCA using only the S and G2/M phase genes (instead of the highly variable genes, as we did before) reveals the extent to which the data is affected by this source of variability:

setty_seurat <- RunPCA(setty_seurat, 
                       features = c(s_genes, g2m_genes),
                       verbose = FALSE)
DimPlot(setty_seurat, group.by = "Phase")

In this PCA plot, we observe that the cells are clearly separated by cell cycle phase, particularly along PC1, which spreads S and G1 cells. This signal can be removed from the data using the ScaleData() function, supplying cell cycle scores to the vars.to.regress argument:

setty_seurat <- ScaleData(setty_seurat, 
                          vars.to.regress = c("S.Score", "G2M.Score"),
                          features = all_genes)

Since this steps takes a long time to run, we will instead load a SeuratObject that contains the pre-computed results of the cell cycle correction:

setty_seurat <- LoadH5Seurat("processed_data/setty_seurat_cc-regressed.h5seurat")
## Validating h5Seurat file
## Initializing RNA with data
## Adding counts for RNA
## Adding scale.data for RNA
## Adding feature-level metadata for RNA
## Adding variable feature information for RNA
## Adding miscellaneous information for RNA
## Adding reduction pca
## Adding cell embeddings for pca
## Adding feature loadings for pca
## Adding miscellaneous information for pca
## Adding command information
## Adding cell-level metadata
## Adding miscellaneous information
## Adding tool-specific results

The corrected counts are stored in the setty_seurat[["RNA"]]@scale.data slot.

The PCA plot now reveals that the cell cycle effect observed on the cell cycle-associated genes no longer exists:

setty_seurat <- RunPCA(setty_seurat, 
                       features = c(s_genes, g2m_genes),
                       verbose = FALSE)

DimPlot(setty_seurat, group.by = "Phase")

Observe how the PCA comptued with VariableFeatures() also reflects this, with cells not being separated by cell cycle phase.

setty_seurat <- RunPCA(setty_seurat, 
                       features = VariableFeatures(setty_seurat),
                       verbose = FALSE)

DimPlot(setty_seurat, group.by = "Phase")


0.10 Cell clustering

0.10.1 Determining the dimensionality of the data

To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on PCA results. The top principal components represent the majority of the variation in the data, and therefore constitute a robust compression of the dataset. However, how many components should we choose to include?

A common method to select the number of relevant PCs is ranking them based on the percentage of variance explained by each one, which is known as an elbow plot and can be generated by Seurat using the ElbowPlot() function:

ElbowPlot(setty_seurat)

The “elbow” will be the point in the plot where the percentage of explained variance starts to flatten, and in which the majority of the signal in the data will already be captured. In this case, the elbow seems to be situated around 6 PCs, but anything up to 10 PCs might be a good choice, since there is a slight plateau after PC 6. We will select 10 PCs for clustering.

0.10.2 Clustering the cells

To find cell clusters, Seurat first computes the “distances” (i.e. the similarity) among cells in the dataset using the FindNeighbors() function and the number of selected PCs, supplied to the dims argument. Then, groups of cells that lay very close together are detected using the FindClusters() function, which requires users to define a resolution parameter. Higher resolution values will lead to producing a larger number of clusters. For our 3K cell dataset, we will set a resolution value of 0.5:

setty_seurat <- FindNeighbors(setty_seurat, dims = 1:10)
## Computing nearest neighbor graph
## Computing SNN
setty_seurat <- FindClusters(setty_seurat, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2983
## Number of edges: 95351
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8756
## Number of communities: 10
## Elapsed time: 0 seconds

Seurat generated 11 cluters. The clusters can be found using the Idents() function, however, identities are also stored in the seurat_clusters column of the metadata table:

head(Idents(setty_seurat))
## GACAGAGCAGGAATGC GCGGGTTGTCGTTGTA AGTGAGGCAGCCTGTG AGACGTTGTAGGCATG 
##                6                3                0                0 
## GAGTCCGTCAGTGTTG GCAGTTATCCTGTACC 
##                7                2 
## Levels: 0 1 2 3 4 5 6 7 8 9
table(setty_seurat$seurat_clusters)
## 
##   0   1   2   3   4   5   6   7   8   9 
## 592 519 394 344 276 233 220 179 163  63

Let’s see how these clusters are distributed in the PCA plot,

DimPlot(setty_seurat, 
        reduction = "pca",
        group.by = "seurat_clusters",
        label = TRUE)

We observe that we generated several large groups that are clearly different from each other, while some of the smaller clusters are more similar, e.g. clusters 0 and 3. There are also some groups that may not be as clearly defined, such as cluster 8. This type of plot may serve to inform the clustering and re-run it with more appropriate parameters, such as different number of PCs, smaller/larger resolution, etc.

We can also generate a tSNE plot of the clustered cells using the PCs that we used for clustering:

setty_seurat <- RunTSNE(setty_seurat, 
                        dims = 1:10)

DimPlot(setty_seurat, 
        reduction = "tsne",
        label = TRUE)

NOTE: distances between clusters in the tSNE plot do not represent the real similarity between the clusters.

0.10.3 Clustering quality control

Although the groups seem fairly well-defined, it is always interesting to run some quality control analyses to verify that we have correctly clustered the data.

One good way to see whether unwanted factors are influencing the clustering is to check the distribution of QC metrics, such as the number of reads, genes and the proportion of mitochondrial expression. Some of the trends might be biological, but others could be coming from the noise in the data, and it is good to be aware of them in downstream analysis.

VlnPlot(setty_seurat, features = "nCount_RNA")

VlnPlot(setty_seurat, features = "nFeature_RNA")

VlnPlot(setty_seurat, features = "percent.mt")

VlnPlot(setty_seurat, features = "S.Score")

VlnPlot(setty_seurat, features = "G2M.Score")

0.11 Save data for hands-on session 4

# remove unnecessary objects
rm(counts, g2m_genes, s_genes, keep_genes, all_genes, total_expr)

# save seurat object with clustering results
SaveH5Seurat(setty_seurat,
             filename = "processed_data/setty_seurat_clustered.h5seurat", overwrite = T)
## Warning: Overwriting previous file
## processed_data/setty_seurat_clustered.h5seurat
## Creating h5Seurat file for version 3.1.5.9900
## Adding counts for RNA
## Adding data for RNA
## Adding scale.data for RNA
## Adding variable features for RNA
## Adding feature-level metadata for RNA
## Adding cell embeddings for pca
## Adding loadings for pca
## No projected loadings for pca
## Adding standard deviations for pca
## No JackStraw data for pca
## Adding cell embeddings for tsne
## No loadings for tsne
## No projected loadings for tsne
## No standard deviations for tsne
## No JackStraw data for tsne